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ON THE SHRINKAGE BEHAVIOR OF PARTIAL LEAST SQUARES 

REGRESSION 

NICOLE KRAMER 


Abstract. We present a formula for the shrinkage factors of the Partial Least Squares 
regression estimator and deduce some of their properties, in particular the known fact that 
some of the factors are > 1. We investigate the effect of shrinkage factors for the Mean 
Squared error of linear estimators and illustrate that we cannot extend the results to nonlinear 
estimators. In particular, shrinkage factors > 1 do not automatically lead to a poorer Mean 
Squared Error. We investigate empirically the effect of bounding the the absolute value of 
the Partial Least Squares shrinkage factors by 1. 
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1. Introduction 

We investigate the shrinkage properties of the Partial Least Squares (PLS) regression esti¬ 
mator. It is known (e.g. [2j) that we can express the PLS estimator obtained after m steps in 
the following way: 

i=l 

where Zi is the component of the Ordinary Least Squares (OLS) estimator along the ith prin¬ 
cipal component of the covariance matrix X*X and Xi is the corresponding eigenvalue. The 
quantities are called shrinkage factors. We show that these factors are determined 

by a tridiagonal matrix (which depends on the input-output matrix {X,y)) and can be calcu¬ 
lated in a recursive way. Combining the results of Q] and PI, we give a simpler and clearer 
proof of the shape of the shrinkage factors of PLS and derive some of their properties. In par¬ 
ticular, we show that some of the values (Ai) are greater than 1 (this was first proved in ^). 

We argue that these "peculiar shrinkage properties" [T] do not necessarily imply that the 
Mean Squared Error (MSE) of the PLS estimator is worse compared to the MSE of the OLS 
estimator: In the case of deterministic shrinkage factors, i.e. factors that do not depend on 
the output y, any value (Ai)| > 1 is of course undesirable. But in the case of PLS, the 
shrinkage factors are stochastic - they also depend on y. Even if P (Ai)| > l) = 1 we 

cannot conclude that the MSE is worse than the MSE of the OLS estimator. In particular, 
bounding the absolute value of the shrinkage factor by 1 does not automatically yield a lower 
MSE, in disagreement to what was conjectured in e.g. j2]. 

Having issued this warning, we explore whether bounding the shrinkage factors leads to a 
lower MSE or not. It is very difficult to derive theoretical results, as the quantities of interest - 
and /^“^(Ai) respectively - depend on y in a complicated, nonlinear way. As a substitute, 
we study the problem on several artificial data sets and one real world example. It turns out 
that in most cases the MSE of the bounded version of PLS is indeed smaller than the one of 
PLS, although the improvement is tiny. 
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The paper is organized as follows: In section|2lwe introduce the notation and in section|S|we 
recall some propertie of Krylov spaces. In sectional we define the PLS estimator and in section 
El we provide mathematical results that are needed in the rest of the paper. After explaining the 
notion of shrinkage in section El we derive the formulas for the PLS shrinkage factors in section 
Hand derive some of their properties. In sectional we report on the results of the experiments. 
The paper ends with a conclusion. 


2. Preliminaries 

We consider the multivariate linear regression model 
(1) y = XI3 + £ 

with 

Cov (y) = cr^ • Id . 

The numbers of variables is p, the number of examples is n . For simplicity, we assume that X 
and y are scaled to have zero mean, so we do not have to worry about intercepts. We have 


a: 

G 

l^nxp. 

A := X*X 

G 


y 

G 

R’", 

II 

G 

RP. 


We set p* = rk (A) = rk (W). The singular value decomposition of X is of the form 

w = ys{7‘ 

with 

V G 

E = diag j \/\) G 

U G . 


pnxp 


We have U^U = Idp and V*V = Idp. 


Set A = E^ . The eigendecomposition of A is 


A = UAU* = 


UiU. 


i=l 


The eigenvalues Xi of A (and any other matrix) are ordered in the following way: 

Ai > A 2 > ... > Ap > 0. 

The Moore-Penrose inverse of a matrix M is denoted by M~. 


The Ordinary Least Squares (OLS) estimator (ioLS is the solution of the optimization prob¬ 


lem 


argrnin ||y-X/3||. 


Set 

( 2 ) 


t 
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The OLS estimator is given by the formula 

(ioLS = 


Set 

' 

V 

Finally, we need a result on the shape of the Moore-Penrose inverse of a symmetric matrix. 
Proposition 1. Let B S he a symmetric matrix with eigendeeomposition 

B = SAS *, 


with eigenvalues Xi. Set 



fBiX) = 

Ai#0 ^ 

As /b(0) = 0 we ean write 


/b(A) 

= A-ttb(A). 

Then 


B- 

= 715 ( 5 ). 


(X^XyX^y 

UA-U^VBVy 

UA-t 

y^u- 


vly 


Proof. The four properties that we have to check are 

(1) {BB-f = BB- , 

(2) {B-B)* = B-B, 

(3) BB-B = B, 

(4) B-BB- = B- . 

As B is symmetric, the polynomial ttb{B) is symmetric as well, which proves the first two 
conditions. Next note that it suffices to prove the and properties 3 and 4 for the diagonal 
matrix 


A 

with k = rk(i3). This is true as 


= diag(Ai,...,Afe,0...,0) 

B~ = {SASy 

= SA-SK 


We have 

A" = diag(Aj-\...,Afc\0...,0) . 

The third property of the Moore-Penrose inverse is A = AA~A which is equivalent to Xi = 
Xi'KB{Xi)Xi which is obviously true. The fourth property follows as easily. □ 

Remark 2. The degree of the polynomial ttb is Tk{B) — 1. The proposition is valid no matter 
if we count the non-zero eigenvalues with or without multiplicities. We count the eigenvalues 
with multiplicities in order to connect the polynomial to the characteristical polynomial in the 
regular case: If B is a regular matrix, ttb is linked to the characterictical polynomial xb in the 
following way: 

X-ttb{X) = -^j^Xb{X) + 1. 

Xb{0) 
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3. Krylov spaces 


Set 


■= {A°b,Ab,...,A"^-^b) e 
The columns of are called the Krylov sequence of A and b. 


The space spanned by the columns of is called the Krylov space of A and b and denoted 
by We recall some basic facts on the dimension of the Krylov space that are needed in 

the rest of the paper. Set 

M :={A,|t, ^0} 

( the vector t is defined in 0 ) and 

m* := \M\ . 

Lemma 3. We have 

dim/C^™ ^ = m*. 


Proof. Suppose that 

m* —1 

jjA^b = 0 

j=o 

for some 70 ,..., 7 m*-i S R- Using the eigendecompostion of A this equation is equivalent to 


E = 0 

As U is an invertible matrix, this is equivalent to 

m* — 1 

E 0 

3=0 

for i = 1,... ,p. Hence, each element € Af is a zero of the polynomial 

m* —1 

E ■ 

4=0 

This is a polynomial of degree < m* — 1. as it has m* = \M. \ different zeroes, it must be trivial, 
i.e. 7 j =0. □ 


Lemma 4. If m > m* we have dim/C^™^ = m* . 


Proof. It is clear that dim/C*^™) > m* as C . Assume that there is a set S' of m* + 1 

linear independent vectors in the Krylov sequence Set 

/ = {ie{i,...m}\W-^beS}. 


Hence |/| = m* + 1. The condition that S is linear independent is equivalent to the following: 
There is no nontrivial polynomial 


such that 


5(A) = ^7. A* 

iGl 


(3) g{X^) = 0 

for Xi G A4. As the polynomial 5 is of degree |J| = m* + 1 and |A1| = m*, there is always a 
nontrivial solution of equation 0 . □ 
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We sum up the two results: 

Proposition 5. We have 


In particular 

( 4 ) 




m 

m < m* 


m > m* 


= dim/C('"*+^) = ... = dimlC^P^ = m* . 


4. Partial Least Squares 

It is not our aim to give an introduction to the Partial Least Squares (PLS) method and 
refer to [^. We take a purely algebraic point of view as in The PLS estimator is the 

solution of the constrained minimization problem 

argrnin Hy —W/3|| 
s.t. 13 G . 

We call m the number of steps of PLS. It follows that any solution of this problem is of the 
form (3 = where z is the solution of the unconstrained problem 

argrnin \\y — XK^'^'>z\\ . 

Z 

Plugging this into the formula for the OLS estimator (cf. section |2l we get 

Proposition 6 (@j). The PLS estimator obtained after m steps can be expressed in the following 
way: 

6 . 

It should be clear that we can replace the matrix if in equation © by any matrix 
as long as its columns span the space In fact, in the NIPALS algorithm (see 0), an 

orthogonal basis of is calculated with the help of the Gram-Schmidt procedure. Denote 
by 

( 6 ) 

this orthogonal basis of Of course, this basis only exists if dim{lC^'^^) = m, which might 

not be true for all m <p. The maximal number for which this holds is m* (see propositional). 
Note however that 

^ = ^(m* + l) _ _ _ ^(p) 

(see Q) and the solution of the optimization problem does not change anymore. Hence for the 
rest of the paper, we make the assumption that 

(7) dim/C(™)=m. 

Remark 7. We have 

= Pols- 

Proof. We show that f3oLS & By definition 

Pols = UK~t 

S C/7rA(A)f. 


( 5 ) 


PPLS 


= 
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with degTTA = p* — 1 (recall that p* is the rank of A). On the other hand, any vector G 
lies in IC^p '> if and only if there is a polynomial g of degree < p* — 1 such that 

V = g{A)b 

= g (UAU^) Ut 
= Ug{A)t. 

It follows that (3ols & As p* > m* we have . □ 


Set 

2 ^(m) _ 

where is as defined in equation 

Proposition 8. The matrix is 

is tridiagonal, i.e Uj = 0 for \i — j\ > 

Proof. The first two statements are obvious. Let i < j — 2. As Wi G , the vector Awi lies 
in the subspace As j > z + 1, the vector wj is orthogonal on in other words 

tji = {wj,Awi)=0. 

As is symmetric, we also have Uj = 0 which proves the assertion. □ 

We will see in sectionOthat the matrices and their eigenvalues determine the shrinkage 
factors of the PLS estimator. To prove this, we list some properties of in teh following 
sections. 


AW^^' 


symmetric and positive semidefinite. Furthermore 

2 . 


5. Tridiagonal matrices 


Definition 9. A symmetric tridiagonal matrix T is called unreduced if all subdiagonal entries 
are non-zero, i.e U^i+i ^ 0 for all i. 

Theorem 10 (jHl)- All eigenvalues of an unreduced matrix are distinct. 

Set 

f ai bi 0 ... 0 \ 

bi 02 b2 ... 0 


rp{m) _ 


0 0 ... Om—l bjji — l 

0 0 ... — l CLjyi j 


Proposition 11. If dim/C^'"^ = m, the matrix is unreduced. More precisely h > 0 for 

all i G {1,..., TO — 1} . 

Proof. Set Vi = A^~^b and denote by wi,... ,Wm the basis obtained by Gram-Schmidt. Its 
existence is guaranteed as we assume that dim/C^™^ = to. For simplicity of notation, we 
assume that the vectors Wi are not normalized to have length 1. By definition 


i-l 


( 8 ) 


= 


,Wk) 


^ {wk,Wk) 

As the vectors Wi are pairwisse orthogonal. It follows that 

{w^,Vi) = {vi,Vi)>0. 


■ Wk . 
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We conclude that 


h 


i 


Avi^=Vi 

i 

i 


{wt,Awt-i) 



E 


{Vi-l,Wk) 

{wk,Wk) 



{Wi,Vi) 


i-2 


-E 


{Vi-l,Wk) 

{wk,Wk) 


{wi,Awk) 


{vi,Vi) > 0 


□ 


Note that the matrix is obtained from by deleting the last column and row of 

2 ^(m)_ follows that we can give a recursive formula for the characteristical polynomials 

:= XtC-) 

of tM. We have 

(9) X^"*^ (A) = (am - A) • (A) - (A) 

and = cii — X . 


We want to deduce properties of the eigenvalues of and A and explore their relationship. 
Denote the eigenvalues of by 

( 10 ) > ... > > 0 . 

Remark 12. All eigenvalues of ^ are eigenvalues of A. 


Proof. First note that 


A 






As the columns of the matrix ^ form an orthonormal basis of \ 

y(m*) = 

is the matrix that represents with repect to this basis. As any eigenvalue of is 

obviously an eigenvalue of A, the proof is complete □ 


The following theorem is a special form of the Cauchy Interlace Theorem. In this version, 
we use a general result from [H] and exploit the tridiagonal structure of . 


Theorem 13. Each interval 


(m) 


(m) 

m-U+l) 


{j = 0, ...,TO — 2) contains a different eigenvalue of (k > 1). In addition, there is a 

different eigenvalue ofT^”^^^^ outside the open interval ) . 


This theorems ensures in particular that there is a different eigenvalue of A in the interval 


(n 

h-k 


(m) 

. Mfc_l 


Theorem m holds independently of assumption 0. 


Proof. By definition, for k > 1 

rp{m-\-k) 


/rp{m-l) Q\ 

I t O^m * I 

y 0 * 
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Here t = (0,..., 0, bm-i), so 

An application of Theorem 10.4.1 in jH] gives the desired result. 


^(m—1) 

t dm 


□ 


Lemma 14. If is unreduced, the eigenvalues of and the eigenvalues of are 

distinct. 


Proof. Suppose the two matrices have a common eigenvalue A. It follows from Q and the fact 
that is unreduced that A is an eigenvalue of Repeating this, we deduce that ai is 

an eigenvalue of a contradiction, as 

0 = = 

□ 


Remark 15. In general it is not true that and a submatrix have distinct eigenvalues. 
Consider the case where ai = c for all i. Using equation m we conclude that c is an eigenvalue 
for all submatrices with m odd. 

Proposition 16. //dim/C*-"*^ = m, we have det ^ 0. 

Proof is positive semidefinite , hence all eigenvalues of are > 0. In other words, 

det 7 ^ 0 if and only if its smallest eigenvalue is > 0. Using Theorem IT^ we 

have 

(m) > > 0 

As dim/C^™^ = m, the matrix is unreduced, which implies that and have 

no common eigenvalues (see HM . We can therefore replace the first > by >, i.e. the smallest 
eigenvalue of is > 0 . □ 


In general, it is not true that det 7 ^ 0. An easy example is 


A = 


2 0 
0 0 


We have 


i.e. dim/C^^^ = 2. On the other hand 


det 




b = 


K(^l(A,b) = (b,Ab) 
A 2 
1 0 


= det 


1 1 
1 1 


= 0 . 


It is well known that the matrices are closely related to the so-called Rayleigh-Ritz 
procedure, a method that is used to approximate eigenvalues. For details consult e.g. (H]. 


6. What is shrinkage? 

We have presented two estimators for the regression parameter (3 - OLS and PLS - which 
also define estimators for X(3 via 
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One possibility to evaluate the quality of an estimator is to determine its Mean Squared Error 
(MSE). In general, the MSE of an estimator 0 for a vector-valued parameter 0 is defined as 


MSE (0) 


E 


trace (j) ~ (^ ~ 



(^E 0 -0^ [e 0 - 6 ») 


-0]+E 





This is the well-known bias-variance decomposition of the MSE. The first part is the squared 
bias and the second part is the variance term. 

We start by investigating the class of linear estimators, i.e. estimators that are of the form 
0 = Sy for some matrix S that does not depend on y. The OLS estimators are linear: 


Pols = (X^X) X^y := Siy 
yoLS = X ■ {X^Xy X*y := S 2 y . 

S 2 is the projection Pl{x) onto the space that is spanned by the columns of X. 


Recall the regression model O- 


Proposition 17. Let 0 = Sy be a linear estimator. We have 


E 


The estimator yoLS is unbiased as 

E [yoLs] 


sxp 

a‘^tr{SS*) . 

= S 2 XP 
= PLix)XP 

= xp. 


The estimator Pols is only unbiased if /3 G range (W*W) 


E 


Pols 


E 


(X^X) X^y 


{X*XyX*E[y] 

{x^xy x^xp 

p. 


Let us now have a closer look at the variance term. 
For Pols we have 


S'lS'J = {X^X) X*X{X*X) 

= {x*xy 

= UA-U* , 

hence 

p* 1 

(11) var (^PoLS^ = • X! X” ■ 

i—1 ^ 

Next note that S 2 is the operator that projects on the space spanned by the columns of X. It 
follows that tr(S'25'|) = rk(W) = p* and that 

var{yoLs) = cr^ ■ P* ■ 
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We conclude that the MSE of the estimator / 3 o ls depends on the eigenvalues Ai,..., Xp* of 
A = X^X. Small eigenvalues of A correspond to directions in X that have very low variance. 
Equation itTTll shows that if some eigenvalues are small, the variance of (3ols is very high, which 
leads to a high MSE. 

One possibility to (hopefully) decrease the MSE is to modify the OLS estimator by shrinking 
the directions of the OLS estimator that are responsible for a high variance. This of course 
introduces bias. We shrink the OLS estimator in the hope that the increase in bias is small 
compared to the decrease in variance. 

In general, a shrinkage estimator for (3 is of the form 


p 



where / is some real-valued function. The values /(Ai) are called shrinkage factors. 
Examples are 

• Principal Component Regression 


1 ith principal component included 
0 otherwise 


/(A.) 


and 

• Ridge Regression 


where A > 0 is the Ridge parameter. 

We will see in section 0 that PLS is a shrinkage estimator as well. It will turn out that the 
shrinkage behavior of PLS regression is rather complicated. 

Let us investigate in which way the MSE of the estimator is influenced by the shrinkage 
factors. If the shrinkage estimators are linear, i.e. the shrinkage factors do not depend on y, 
this is an easy task. Let us first write the shrinkage estimator in matrix notation. We have 



The diagonal matrix Dshr has entries /(Ai). The shrinkage estimator for y is 


yshr — Sshr,2y 

= vx:x:-D,hrV*. 


We calculate the variance of these estimators. 


tr {Sshr,iSlf^^ -y) = tr (C/E DshrU*) 

= trace (T,~DfY,~D 





ON THE SHRINKAGE BEHAVIOR OF PARTIAL LEAST SQUARES REGRESSION 


11 


and 

tv{S,hrA,^2) = ir{VY.i:-Dshr^^-D,hrV^) 

= tr {Y:Ts~D shr^'^~Dshr) 

= E(/((A,)f- 

i=l 

Next, we calculate the bias of the two shrinkage estimators. We have 

^ [*S^s^r,iy] — Sshr,l^ 

= U^Dshr^-U^P. 

It follows that 


bias' 




shr I — 


= (E [Sshr,iy] - / 3 )* {E [Sshr,iy] - f 3 ) 


{U^PY {T,DfE- - Id)‘ {EDfE- - Id) (C/‘/3) 

j2ifiK)-iY[ uIpY . 


Replacing Sshr,i by Sshr ,2 it is as easy to show that 


bias^ (yshr) = ^ Ai (/(A^) - 1)^ ■ 


Theorem 18. For the shrinkge estimator Pshr and yshr defined above we have 


MSE(pshr) = ^(/(A 0 - 1 )"K/ 3 )Vi 72 


" (/(A.))" 


2=1 

P* 


2=1 


A.; 


MSEivshr) = ^A,(/(A,)-lfK/ 3 )Va 2 ^(/(A,)) 


1=1 


i=l 


If the shrinkage factors are deterministic, i.e. they do not depend on y, any value f{Xi) 1 
increases the bias. Values |/(Ai)| < 1 decrease the variance, whereas values |/(Ai)| > 1 increase 
the variance. Hence an absolute value > 1 is always undesirable. The situation is completely 
different for stochastic shrinkage factors. We will discuss this in the following section. 

Note that there is a different notion of shrinkage, namely that the I 2 - norm of an estimator 
is smaller than the Z2-norm of the OLS estimator. Why is this a desirable property? Let us 
again consider the case of linear estimators. Set (3i = Sty for i = 1, 2. We have 


The property that for all y gMF 
is equivalent to the condition that 


IIAII2 = y*sls,y. 
II/31II2 < II/32II2 
si Si - SIS 2 


is negative semidefinite. The trace of negative semidefinite matrices is < 0. Furthermore 
trace {SjSi) = trace (SiSf), so we conclude that 


var 


(/3i) < var (^ 02 ^ 


It is known (see jH]) that 

\0Ylsh < 0Yish <■■■< = WoLsh ■ 
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7. The shrinkage factors of PLS 

In this section, we give a simpler and clearer proof of the shape of the shrinkage factors of 
PLS. Basically, we combine the results of Q] and [HI- It turns out that some of the factors 
are greater than 1. We try to explain why these "peculiar shrinkage properties" do 
not necessarily imply that the MSE of the PLS estimator is increased. 

Denote by the polynomial associated to that was defined in proposition QJ i.e. 


Recall that the eigenvalues of pl™) are denoted by . It follows that 


( 12 ) 


/(™)(A) := A-7r(™)(A) = l-J^ 1- 




(m) 


By definition of PLS, S hence there is a polynomial tt of degree < m — 1 with 

/3pLS = 7r(A)&. 

Proposition 19 (jHI)* Suppose that dim/C*-™^ = m. We have 



o(m) 

PpLS 

= 7r(™)(A)-6. 

Proof (^). By proposition Q] 






We plug this into equation Q 

and obtain 


ai-m) 

PPLS — 

^r{Tn)^(m) I 

Aw'^^n 


Recall that the columns of form an orthonormal basis of IC^'^\A,b). It follows that 

iy(™) is the operator that projects on the space In particular 

ITI™) (w^^'>y A3b = A^b 

for j = l,...,m—1. This implies that 

= 7rW(A).6. 


□ 

Corollary 20 ((HI)- Suppose that dim/C^™^ = m. If we denote by Zi the component of Sols 
along the ith eigenvector of A then 




where is the polynomial defined in llWl . 
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Proof. ([ 2 ]) This follows immediately from the proposition above. We have 


/3 


(m) 

PLS 


13 


C/7r(™)(A)EF‘y 

p* 

{Xz)'/k{viyyui 

i=l VA* 


□ 


We now show that some of the shrinkage factors of PLS are y 1. 


Theorem 21 (|Tj). For each m < m* — 1, we can decompose the interval [Ap, Ai] into m + 1 
disjoint intervals^ 

h ^ h < ■ ■ ■ ^ Im+l 


such that 


/(-) (AO 



Xi G Ij and j odd 
Xi G Ij and j even 


Proof. Set = 1—/("*). It follows from ennation itT^ that the zero’s of are . 

As pl™) is unreduced, all eigenvalues are distinct. Set = Ai and = Xp. Define 

for j = 0,..., m. By definition, = 1- Hence is non-negative on 

the intervals Ij if j is odd and is non-positive on the intervals Ij if j is even. It follows 
from Theorem d that all interval Ij contain at least one eigenvalue A^ of A. □ 


In general it is not true that f^"^\Xi) y 1 for all Xi and m = 1,..., m* . Using the example 
in remark and the fact that 

/(-){A 0 = 1 

is equivalent to the condition that Xi is an eigenvalue of , it is easy to construct a coun¬ 
terexample. Using some of the results of section 0 we can however deduce that some factors 
are indeed ^ 1. As all eigenvalues of and ^ are distinct (c.f. proposition [g , we 

see that 1 for all L In particular 


/(-*-i)(Ai) 



m* even 
m* odd 


More generally, using propositional we conclude that fi'^-P (Aj) and (Ai) is not possible. 
In practice - i.e. calculated on a data set - the factors seem to be ^ 1 all of the time. 
Furthermore 


0</(™’(Ap) <1. 

To proove this, we set = 1 — /(’"). We have by definition 5^™^(0) = 1. Furthermore, the 
smallest positive zero of is and it follows from Theorem d and proposition d that 

XpKpyyl Hence5W(Ap) g]0,1]. 


^We say that Ij < 1^ if sup Ij < inf p . 
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Using Theorem ns more precisely 


Ap < < A^ 


it is possible to bound the terms 


1 - 


A, 




(m) 


From this we can derive bounds on the shrinkage factors, 
readers who are interested in the bounds should consult [H]. 
the MSE of the PLS estimator. 


We will not pursue this further, 
Instead, we have a closer look at 


In section El we showed that a value |/^™HAi)| > 1 is not desirable, as the variance of the 
estimator increases. Note however, that in the case of PLS, the factors are stochastic; 

they depend on y - in a nonlinear way. For we have the following situation: If we set 

2 — we have to compare 

var{Z ■ W) to var{W). 

Note that the RHS is not necessarily smaller than the LHS, even if P{Z > 1) = 1. An easy 
counterexample is Z = ttf - the LHS is 0. 


Among others, j2] 
following way. Set 


proposed to bound the shrinkage factors of the PLS estimator in the 


/(™HaO = 


and define a new estimator: 


(13) 


INBOUND 


+1 /('")( A )>+1 

-1 /M(A)<-i 

otherwise 

:= 

2=1 



If the shrinkage factors are numbers, this will improve the MSE (cf. section H. But in the 
case of stochastic shrinkage factors, the situation is completely unclear. Consider again the 
example Z = Set 


Z = 


+1 

-1 


Z > 1 
Z <-l 
otherwise 
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In this case 

0 = var{Z ■ W) < var{Z ■ W) 

so it is not clear whether the modified estimator BOUND leads to a lower MSE, which was 
conjectured in e.g. | 2 I. 

The above example (involving W and Z) is of course purely artificial. It is not clear whether 
the shrinkage factors behave this way. It is hard if not infeasable to derive statistical properties 
of the PLS estimator or its shrinkage factors, as they depend on y in a complicated, nonlinear 
way. As an alternative, we compare the two different estimators on different data. 

8. Experiments 

In this section, we explore the difference between the methods PLS and BOUND. We inves¬ 
tigate three artificial datasets and one real world example. In all examples, we rescale X and 
y to have zero mean and unit variance. 


Let us start with the artificial datasets. Of course, artificial datasets do not reflect many 
real world situations, but we have the advantage that we know the true regression coefficient [3 
and that we have an unlimited amount of examples at hand. We can estimate the MSE of any 
of the four estimators: For k = we generate a sample y and calculate the estimator 

9k- We define 

1 ^ t 

■ 

For all examples, we choose K = 200. 


First example. In our first example we generate n = 30 examples in the following way: 
The input data is the realistion of a p = 10 dimensional normally distributed variable with 
expectation 0 G Rp and covariance matrix E G defined as 


Ztij — 



i=j 

j 


The regression coefficient (3 is the random permutation of (0,0, 0,0, 0, zi,..., Z 5 ) with Zi ~ 
N(2,2^). 


Next we determine the variance of the error term. We do this by considering several signal- 
to-noise-ratios (stnr). This quantity is defined as 


stnr 


var{X(3) 

var{e) 


We set stnr = 1,4,16 and determine the corresponding value of a. We generate K = 200 
samples y and calculate the four estimators. 


The following figures show the estimated MSE for /3 and X(3 respectively. The solid lines 
with the •’s correspond to PLS. the lines with the -l-’s correspond to BOUND. 
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Figure 1. First example: Comparison of PLS and BOUND [stnr = 1) 



Figure 2. First example: Comparison of PLS and BOUND [stnr = 4) 



Figure 3. First example: Comparison of PLS and BOUND [stnr = 16) 

We see that BOUND is better in all cases, although the improvement is not dramatic. We 
should remark that both method pick the same (optimal) number of steps most of the times. 
The difference between the two methods is especially tiny (but non-zero) in the first step. We 
do not have an explanation for this phenomenon. The MSE is the same for the last step m = 10 
as in this case 

Pt^S=^PtoUND=hLS- 

Second example. In this example, we generate n = 40 examples. The input data is the 
realisation of a p = 20 dimensional random variable with distribution A^(0, E). The covariance 
matrix is defined as in the first example (with p = 10 replaced by p = 20). Again, the coef¬ 
ficients of (3 are a random permutation (0,..., 0, zi,..., ziO) with Zi ~ N[2, 2^). We consider 
the signal-to-noise-ratios 1,4,16 . 
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Figure 4. Second example: Comparison of PLS and BOUND {stnr = 1) 



m - 


Figure 5. Second example: Comparison of PLS and BOUND [stnr = 4) 



Figure 6 . Second example: Comparison of PLS and BOUND [stnr = 16) 

The results are qualitatively the same as those from the first example. BOUND is better all 
of the times, the optimal number of steps are the same for both methods. 

Third example. The input data is generated as in the second example, in particular, we have 
p = 20. This time, we only generate n = 10 examples. The coefficients of the regression vector 
f} are realizations of a N[2,2‘^) distibuted random variable. We investigate the signal-to-noise- 
ratios 1,4,16 . As we have more variables than examples, we do not investigate estimators for 
ft): Different vectors /3i ^ can lead to Xj3i = Xso it does not make sense to determine 
the bias of an estimator for f3. Instead, we only show the figures for ijpLS and Vbound- 
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Figure 7. Third example: Comparison of PLS and BOUND {stnr = 1) 



Figure 8 . Third example: Comparison of PLS and BOUND {stnr = 4) 



Figure 9. Third example: Comparison of PLS and BOUND {stnr = 16) 

Again, the estimated MSE of BOUND is lower than the estimated MSE of PLS. 

Fourth example. This example is taken from [Jj. A survey investigated the degree of job sat¬ 
isfaction of the employees of a company. The employees filled in a questionnaire that consisted 
of p = 26 questions regarding their work environment and one question (the response variable) 
regarding the degree to which they are satisfied with their job. The answers of the employees 
were summerized for each of the n = 34 departments of the company. 

We compare the two methods PLS and BOUND on this data set. For each m = 1,... 26 we 
determine the lOfold crossvalidation error. 
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Figure 10. Left: lOfold crossvalidation error. Right: lOfold crossvalidation 
error for the first 6 components 

The method BOUND is slightly better than PLS on this data set: The cv error for the 
optimal number of components (which is niopt = 2) is 0.2698 for BOUND and 0.2747 for PLS. 
It is remarkable that in this example the cv error of BOUND exceeds the cv error of PLS in 
some cases. It is not clear if this is due to the small number of examples (which makes the 
estimation unprecise) or if this can also happen "in theory". 

9. Conclusion 

This paper consists of two parts. In the first part, we gave alternative and hopefully clearer 
proofs of the shrinkage factors of PLS. In particular, we derived the fact that some of the shri- 
nakge factors are > 1. We explained in detail that this would lead to an unnecessarily high MSE 
if PLS was a linear estimator. This is however not the case and we emphasized that bound¬ 
ing the absolute value of the shrinkage factors by 1 does not automatically lead to a lower MSE. 

In the second part, we investigated the problem numerically. Experiments on simulated and 
real world data showed that it might be better to adjust the shrinkage factors so that their 
absolute value is < 1 - a method that we called BOUND. The difference between BOUND and 
PLS was not dramatic however. Besides, the scale of the experiments was of course way too 
small, so it would be light-headed if we concluded that we should always use BOUND instead 
of PLS. 

Nevertheless, the experiments show that it is worth exploring the method BOUND in more 
detail. One drawback of this method is that we have to adjust the shrinkage factors "by hand". 
If bounding the shrinkage factors tends to lead to better results, we might modify the original 
optimization problem of PLS such that the shrinkage factors of the solution are bounded. We 
might modfify A and b to obtain a different Krylov space or replace by a different set of 
feasible solutions. 
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